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Abstract 

We explore the redundancy of parameters in deep neu¬ 
ral networks by replacing the conventional linear projec¬ 
tion in fully-connected layers with the circulant projec¬ 
tion. The circulant structure substantially reduces memory 
footprint and enables the use of the Fast Fourier Trans¬ 
form to speed up the computation. Considering a fully- 
connected neural network layer with d input nodes, and 
d output nodes, this method improves the time complexity 
from 0{d^) to 0{d log d) and space complexity from 0{d^) 
to 0{d). The space savings are particularly important for 
modern deep convolutional neural network architectures, 
where fully-connected layers typically contain more than 
90% of the network parameters. We further show that the 
gradient computation and optimization of the circulant pro¬ 
jections can be performed very efficiently. Our experiments 
on three standard datasets show that the proposed approach 
achieves this significant gain in storage and efficiency with 
minimal increase in error rate compared to neural networks 
with unstructured projections. 

1. Introduction 

Deep neural network-based methods have recently 
achieved dramatic accuracy improvements in many areas of 
computer vision, including image classification Il^l45ll25l . 
object detection cuEa, face recognition ll38l[^ . and text 
recognition Il2l l20ll . 

These high-performing methods rely on deep networks 
containing millions or even billions of parameters. For ex¬ 
ample, the work by Krizhevsky et al. 1^ achieved break¬ 
through results on the 2012 ImageNet challenge using a net¬ 
work containing 60 million parameters with five convolu¬ 
tional layers and three fully-connected layers. The top face 
verification results on the Labeled Faces in the Wild (LFW) 
dataset were obtained with networks containing hundreds 
of millions of parameters, using a mix of convolutional, 

‘indicates equal contributions. 


locally-connected, and fully-connected layers ll^ lT5l . In 
architectures that rely only on fully-connected layers, the 
number of parameters can grow to billions a. 

As larger neural networks are considered, with more lay¬ 
ers and also more nodes in each layer, reducing their stor¬ 
age and computational costs becomes critical to meeting 
the requirements of practical applications. Current efforts 
towards this goal focus mostly on the optimization of con¬ 
volutional layers Eu nn m, which consume the bulk of 
computational processing in modem convolutional archi¬ 
tectures. We instead explore the redundancy of parameters 
in fully-connected layers, which are often the bottleneck in 
terms of memory consumption. Our solution relies on a 
simple and efficient approach based on circulant projections 
to significantly reduce the storage and computational costs 
of fully-connected neural network layers, while mantaining 
competitive error rates. Our work brings the following ad¬ 
vantages: 

• Fully-connected layers in modem convolutional architec¬ 
tures typically contain over 90% of the network parame¬ 
ters. Our approach significantly reduces the cost to store 
these networks in memory, which is crucial for GPUs or 
embedded systems with tight memory constraints. 

• The proposed method enables FFT-based computation, 
which speeds up evaluation of fully connected layers. 
This is especially useful for neural networks with many 
fully connected layers, or consisting exclusively of fully 
connected layers oiia. 

• With much fewer parameters, our method is empirically 
shown to require less training data. 

1.1. Overview of the proposed approach 

A basic computation in a fully-connected neural network 
layer is 

/i(x) = (/)(Rx), (1) 

where R G and </>(■) is a element-wise nonlinear 

activation function. The above operation connects a layer 



with d nodes, and a layer with k nodes. In convolutional 
neural networks, the fully connected layers are often used 
before the final softmax output layer, in order to capture 
global properties of the image. The computational com¬ 
plexity and space complexity of this linear projection are 
0{dk). In practice, k is usually comparable or even larger 
than d. This leads to computation and space complexity at 
least 0{d'^), creating a bottleneck for many neural network 
architectures. 

In this work, we propose to impose a circulant structure 
on the projection matrix R in ([I]Q 

/i(x) = (^(Rx), R is a circulant matrix. (2) 

This special structure dramatically reduces the number of 
parameters. It also allows us to use the Fast Fourier Trans¬ 
form (FFT) to speed up the computation. Considering a 
neural network layer with d input nodes, and d output 
nodes, the proposed method reduces the space complex¬ 
ity from 0{d?) to 0{d), and the time complexity from 
0{d?) to O{d\ogd). Table [^compares the time and space 
complexity of the proposed approach with the conventional 
method. 

Surprisingly, although the circulant matrix is highly 
structured with a very small number of parameters (0(d)), 
it captures the global information well, and does not im¬ 
pact the final performance much. We show empirically that 
our method can provide significant reduction of storage and 
computational costs while achieving very competitive error 
rates. 

1.2. Organization 

Our work is organized as follows. We propose impos¬ 
ing the circulant structure on the linear projection matrix 
of fully-connected layers of neural networks to speed up 
computations and reduce storage costs in Section We 
show a method which can efficiently optimize the neural 
network while keeping the circulant structure in Section]^ 
We demonstrate with experiments on visual data that the 
proposed method can speed up the computation and reduce 
memory needs while maintaining competitive error rates in 
Section We begin by reviewing related work in the fol¬ 
lowing section. 

2. Related Work 
2.1. Deep Learning 

In the past few years, deep neural network methods have 
achieved impressive results in many visual recognition tasks 
iiiiiiMiiniiia. Recent advances on learning these mod¬ 
els include the use of drop-out IITtII to prevent overfitting, 

* A sign flipping operation is applied before the circulant projection ma¬ 
trix. We will present the formal framework in Section|3.1| 


Method 

Time 

Space 

Time (Learning) 

Conventional NN 

~0(¥) 

Oid^) 

0{ntd'‘) 

Circulant NN 

0(d\ogd) 

0{d) 

0{ntd log d) 


Table 1. Comparison of the proposed method with neural networks 
based on unstructured projections. We assume a fully-connected 
layer, and the number of input nodes and number of output nodes 
are both d. t is the number of gradient steps in optimizing the 
neural network. 

more effective non-linear activation functions such as rec¬ 
tified linear units IH or max-out 03, and richer model¬ 
ing through Network in Network (NiN) ll25l . In particular, 
training high-dimensional networks with large quantities of 
training data is key to obtaining good results, but at the same 
time incurs increased computation and storage costs. 

2.2. Compressing Neural Networks 

The work of Collins and Kohli 13 addresses the prob¬ 
lem of memory usage in deep networks by applying 
sparsity-inducing regularizers during training to encour¬ 
age zero-weight connections in the convolutional and fully- 
connected layers. Memory consumption is reduced only at 
test time, whereas our method cuts down storage costs at 
both training and testing times. Other approaches exploit 
low-rank matrix factorization ll32l [6l to reduce the number 
of neural network parameters. In contrast, our approach ex¬ 
ploits the redundancy in the parametrization of deep archi¬ 
tectures by imposing a circulant structure on the projection 
matrix, reducing its storage to a single column vector, while 
allowing the use of FFT for faster computation. 

Techniques based on knowledge distillation 03 aim to 
compress the knowledge of a network with a large set of pa¬ 
rameters into a compact and fast-to-execute network model. 
This can be achieved by training a compact model to imi¬ 
tate the soft outputs of a larger model. Romero et al ED 
further show that the intermediate representations learned 
by the large model serve as hints to improve the training 
process and final performance of the compact model. In 
contrast, our work does not require the training of an auxil¬ 
iary model. 

Network in Network ll25l has been recently proposed as 
a tool for richer local patch modeling in convolutional net¬ 
works, where linear convolutions in each layer are replaced 
by convolving the input with a micro-network filter defined, 
for example, by a multi-layer perceptron. The inception 
architecture ED extends this work by using these micro¬ 
networks as dimensionality reduction modules to remove 
computational bottlenecks and reduce storage costs. A key 
differentiating aspect is that we focus on modeling global 
dependencies and reducing the cost of fully connected lay¬ 
ers, which usually contain the large majority of parameters 
in standard configurations. Therefore, our work is com- 










plementary to these methods. Although ||25]| suggests that 
fully-connected layers could be replaced by average pool¬ 
ing without hurting performance for general image classi¬ 
fication, other works in computer vision ll^ and speech 
recognition ll?4ll highlight the importance of these layers to 
capture global dependencies and achieve state-of-the-art re¬ 
sults. 

2.3. Speeding up Neural Networks 

Several recent methods have been proposed to speed-up 
the computation of neural networks, with focus on convo¬ 
lutional architectures EB EH m 0. Related to our work, 
Mathieu et al. Il26l use the Fast Fourier Transform to ac¬ 
celerate the computation of convolutional layers, through 
the well-known convolution theorem. In contrast, our work 
is focused on the optimization of fully-connected layers by 
imposing circulant structure on the weight matrix to speed 
up the computation in both training and testing stages. 

In the context of object detection, many techniques such 
as detector cascades or segmentation-based selective search 
Enin have been proposed to reduce the number of can¬ 
didate object locations in which a deep neural network is 
applied. Our proposed approach is complementary to these 
techniques. 

Other approaches for speeding up neural networks rely 
on hardware-specific optimizations. For example, fast neu¬ 
ral network implementations have been proposed for GPUs 
lISl . CPUs HOl . FPGAs ifTOl . and on-chip implementations 

EQl. 

Our method is also related to the recent efforts 
around“shallow” neural networks, which show that some¬ 
times shallow structures can match the performance of deep 
structures EHlEZlIISlIHl. 

2.4. Linear Projection with Structured Matrices 

Structured matrices have been used in improving the 
space and computational complexities for different learn¬ 
ing paradigms. For example, circulant matrices have been 
used in dimensionality reduction IItTII . binary embedding 
||43]| and kernel approximation ll44l . It has been shown that 
circulant structure can be used to save space and computa¬ 
tion costs without performance degradation. The properties 
of circulant matrices have also been exploited to avoid ex¬ 
pensive rounds of hard negative mining in training of object 
detectors m and for real-time tracking US). 

One could in principle use other structured matrices such 
as Hadamard matrices along with a sparse random Gaus¬ 
sian matrix to achieve fast projection as was done in the fast 
Johnson-Lindenstrauss transform Gill, but they are slower 
than the circulant projection and need more space. We note 
that very recently, this idea has been studied in 1421 . 


3. Circulant Neural Network Model 

In this section, we present the general framework of the 
circulant neural network model, showing the advantages of 
this model in achieving more efficient computational pro¬ 
cessing and storage cost savings. 

3.1. Framework 

A circulant matrix R € is a matrix defined by a 
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Let D be a diagonal matrix with each diagonal entry 
being a Bernoulli variable (±1 with probability 1/2). For 
X G its (/-dimensional output is: 

h(x) = </>(RDx), R = circ(r). (4) 

The projection with the matrix D corresponds to a ran¬ 
dom sign flipping step on the data. 

3.2. Motivation 

The idea of replacing unstructured projection with circu¬ 
lant projection is motivated by using circulant projections 
in dimensionality reduction ll4T1 . binary embedding ||43]| . 
and kernel approximation ll44ll . From the efficiency point of 
view, this structure creates great advantages in both space 
and computation (detailed in Section [34 1, and enables effi¬ 
cient optimization procedures (SectionBl. 

It is shown in previous works lfTHlTOl43ll44ll that when 
the parameters of the circulant projection matrix are gener¬ 
ated iid. from the standard normal distribution, the circulant 
projection (with the random sign flipping matrix) mimics an 
unstructured randomized projection. In other words, ran¬ 
domized circulant projections can also be used in different 
frameworks to preserve pairwise £2 distance MM, angle 
ii, and shift-invariant kernels m. It is then reasonable 
to conjecture that randomized circulant projections can also 
achieve good performance in neural networks (compared to 
using unstructured randomized matrices). This is indeed 
true, as we will demonstrate empirically. And similar to 
binary embedding and kernel approximation, by optimizing 
the parameters of the projection matrix, we can significantly 
improve the performance. 

3.3. The Need for Matrix D 

This matrix is required in prior work ifTH |4T114^ l44ll . 
By adding the random sign flipping matrix, the resulting 





projections are less correlated ESSI]. In practice, the per¬ 
formance of a circulant neural network drops when the ran¬ 
dom sign flipping is not performed, which we demonstrate 
in Section [53] To simplify the notation, we omit the matrix 
D in the following sections. 

3.4. Space and Time Efficiency 

The two main advantages of the circulant binary embed¬ 
ding are superior space and time efficiency. 

Space Efficiency. Typically, over 90% of the storage cost of 
convolutional neural networks is due to the fully connected 
layers. So “compressing” such layers is a very important 
task. As the proposed circulant projection contains only 
2d parameters {d floats for R, and d booleans for Df] the 
space complexity is 0{d). This is a signihcant advantage 
compared to the conventional fully connected layer which 
requires 0{d^) parameters. For example, when d = 4096, 
as in the “AlexNet” E3i . the proposed method can decrease 
memory requirements by a factor of thousands, making 
the most space consuming component (the fully connected 
layer) negligible in memory cost. We will show in Section]^ 
that surprisingly, the circulant neural network can have very 
competitive performance with such dramatic space savings. 
The small number of parameters also makes the neural net¬ 
work perform better with limited amount of training data. 
In addition, we can further improve the performance (yet 
maintain signihcant space savings) by adding more nodes 
(“fatter” layers) and more layers. 

Time Efficiency. The structure enables the use of Fast 
Fourier Transform (FFT) to speed up the computation. For 
d-dimensional data, the 1-layer circulant neural network 
has time complexity O{d\ogd). Next we explain how we 
achieve this time complexity. Given a data point x, h(x) 
can be efficiently computed as follows. Denote by ® the 
operator of a circulant convolution. Based on the dehnition 
of a circulant matrix, 

Rx = r @ X. (5) 

The convolution above can be computed more efficiently in 
the Fourier domain, using the Discrete Fourier Transform 
(DFT), for which a fast algorithm (FFT) is available. 

/i(x) = (^ o J-(x))) , (6) 

where J^( ) is the DFT operator, and ) is the inverse 

DFT (IDFT) operator. As DFT and IDFT can be efficiently 
computed in O(dlogd) with FFT ll29ll . the proposed ap¬ 
proach has time complexity O(dlog d). 

3.5. When fc ^ d 

We have so far assumed the number of nodes in the input 
layer d to be equal to the number of nodes in the output 

^Note that the circulant matrix is never explicitly computed or stored. 


Iayer0 In this section, we provide extensions to handle the 
case where k ^ d. 

When k < d, the fully connected layer performs a “com¬ 
pression” of the signal. In this case, we still use the circulant 
matrix R G with d parameters, but the output is set 

to be the hrst k elements in Q. The circulant neural net¬ 
work is not computationally more efficient in this situation 
compared to when k = d. 

When k > d, the fully connected layer performs an “ex¬ 
pansion” of the signal. In this case, the simplest solution is 
to use multiple circulant projections, and concatenate their 
output. This has computational complexity O{klogd), and 
space complexity 0{k). Note that the DFT of the feature 
vector can be reused in this case. An alternative approach 
is to extend every feature vector to a fc-dimensional vector, 
by appending k — d zeros. This returns the problem to the 
previous setting described in section [3T| with d replaced by 
k. This gives space complexity 0{k), and computational 
complexity 0{k\ogk). In practice, k is usually at most a 
few times larger than d. Empirically the two approaches 
take similar computational time. Our experimental results 
are based on the second approach. 

4. Training Circulant Neural Networks 

In this section, we propose a highly efficient way of train¬ 
ing circulant neural networks. We also discuss a special 
type of circulant neural network, where the parameters of 
the circulant matrix are randomized instead of optimized. 

4.1. Gradient Computation 

The most critical step for optimizing a neural network 
given a training set is to compute the gradient of the er¬ 
ror function with respect to the network weights. Let us 
consider the conventional neural network with two layers, 
where the hrst layer computes a linear transformation fol¬ 
lowed by a nonlinear activation function; 

/i(x) = ())(Rx), (7) 

where R is an unstructured matrix. We assume the second 
layer is a linear classiher with weights w. Therefore the 
output of the two-layer neural network is 

J(x) = w^(/)(Rx). (8) 

When training the neural network, computing the gradi¬ 

ent of the error function involves computing the gradient of 
J(x) with respect to each entry of R. It is easy to show that 

{^i.yd)Xj, ( 9 ) 

Olhij 

? = 0, * * * , d — 1. j = 0, • • • , d — 1. (10) 

^Such a setting is commonly used in fully connected layers of recent 
convolutional neural network architectures. 





where (/)'(•) is the derivative of </>(•). 

Note that suffices for the gradient-based optimiza¬ 
tion of neural networks, as the gradient w.r.t. networks with 
more layers can simply be computed with the chain rule, 
leading to the well-known “back-propagation” scheme. 

In the circulant case, we need to compute the gradient of 
the following objective function; 

d-i 

,/(x) = w^^(Rx) = ^ Wi(j> (Ri-x), R = circ(r). 

i=0 

( 11 ) 

It is easy to show that 

dw ^ ((^'(Rx) o s_»j(x)) (12) 

= S->*(x)^(wo^'(Rx)), (13) 

where —>■ M'^, right (downwards for a column 

vector) circularly shifts the vector by one element. There¬ 
fore, 

VrJ(x) (14) 

= [s_,o(x),s_,i(x),- • • ,s^(rf_i)(x)]'^(wo<))'(Rx)) 

= circ(s_i.i(rev(x)))(w o ^'(Rx)) 

=s_n(rev(x)) @ (w o cj)'{r @ x)), 

where, 

rev(x) = {xd_i,Xd_ 2 ,---,Xo), 

s_>i(rev(x)) = {xo,Xd-i,Xd- 2 , ■ ■ ■ 

The above uses the same trick of converting the circulant 
matrix multiplication to circulant convolution. Therefore, 
computing the gradient takes only O{d\ogd) time with 
FFT. Training a multi-layer neural network is nothing more 
than applying ( [T3] l in each layer with the chain rule. 

Note that when fc < d, we can simply set the last d — k 
entries of w in (|^ to be zero. And when k > d, the above 
derivations can be applied with minimal changes. 

4.2. Randomized Circulant Neural Networks 

We also consider the case where the elements of r in 0 
are generated independently from a standard normal distri¬ 
bution Af (0,1). We refer to these models as randomized cir¬ 
culant neural networks. In this case, the parameters of the 
circulant projections are defined by random weights, with¬ 
out optimization. In other words, in the optimization pro¬ 
cess, only the parameters of convolutional layers and the 
softmax classification layer are optimized. This setting is 
interesting to study as it provides insight into the “capacity” 
of the model, independent of specific optimization mecha¬ 
nisms. 

We will show empirically that compared to unstructured 
randomized neural networks, the circulant neural network 


is faster with the same number of nodes, while achieving 
similar performance. This surprising result is in line with 
recent theoreticakempirical discoveries around using circu¬ 
lant projections on dimensionality reduction ED, and bi¬ 
nary embedding H3l . It has been shown that the circulant 
projection behaves similarly to fully randomized projec¬ 
tions in terms of the distance preserving properties. In other 
words, the randomized circulant projection can be seen as a 
simulation of the unstructured randomized projection, both 
of which can capture global properties of the data. 

In addition, we will show that with the optimizations 
described in Section |4.1| the error rate of the neural net¬ 
works improves significantly over the randomized version, 
meaning that the circulant structure is flexible and powerful 
enough to be used in a data-dependent fashion. 

5. Experiments 

We apply our model to three standard datasets in our ex¬ 
periments; MNIST, CIFAR-10, and ImageNet. We note 
that it is not our goal to obtain state-of-the-art results on 
these datasets, but rather to provide a fair analysis of the 
effectiveness of circulant projections in the context of deep 
neural networks, compared to unstructured projections. 

Next we describe our implementation and analysis of ac¬ 
curacy and storage costs on these three datasets, followed by 
an experiment on reduced training set size. 

5.1. Experiments on MNIST 

The MNIST digit dataset contains 60,000 training and 
10,000 test images of ten handwritten digits (0 to 9), with 
28 X 28 pixels. We use the LeNet network ll24l as our 
basic CNN model, which is known to work well on digit 
classification tasks. LeNet consists of a convolutional layer 
followed by a pooling layer, another convolution layer fol¬ 
lowed by a pooling layer, and then two fully connected lay¬ 
ers similar to conventional multilayer perceptrons. We used 
a slightly different version from the original LeNet imple¬ 
mentation, where the sigmoid activations are replaced by 
Rectified Linear Unit (ReLU) activations for the neurons. 

Our implementation extends Caffe ll22ll . by replacing the 
weight matrix of the proposed circulant projections with the 
same dimensionality. The results are compared and shown 
in Table |2l Our fast circulant neural network achieves an 
error rate of 0.95% on the full MNIST test set, which is very 
competitive with the 0.92% error rate from the conventional 
neural network. At the same time, the circulant LeNet is 
5.7x more space efficient and L43x more time efficient than 
LeNet. 

5.2. Experiments on CIEAR 

CIFAR-10 is a dataset of natural 32x32 RGB images 
covering 10-classes with 50,000 images for training and 



Method 

Train Error 

Test Error 

Memory (MB) 

Testing Time (sec.) 

LeNet 

0.35% 

0.92% 

1.56 

3.06 

Circulant LeNet 

0.47% 

0.95% 

0.27 

2.14 


Table 2. Experimental results on MNIST. 


Method 

Train Error 

Test Error 

Memory (MB) 

Testing Time (sec.) 

CIEAR-IOCNN 

4.45% 

15.60% 

0.45 

4.56 

Circulant CIEAR-10 CNN 

6.57% 

16.71% 

0.12 

3.92 


Table 3. Experimental results on CIFAR-10. 


10,000 for testing. Images in CIFAR-10 vary significantly 
not only in object position and object scale within each 
class, but also in object colors and textures. 

The CIFARIO-CNN network iflTl used in our test con¬ 
sists of 3 convolutional layers, 1 fully-connected layer and 
1 softmax layer. Rectified linear units (ReLU) are used as 
the activation units. The circulant CIFARIO-CNN is im¬ 
plemented by using a circulant weight matrix to replace the 
fully connected layer. Images are cropped to 24x24 and 
augmented with horizontal flips, rotation, and scaling trans¬ 
formations. We use an initial learning rate of 0.001 and train 
for 700-300-50 epochs with their default weight decay. 

A comparison of the error rates obtained by circulant and 
unstructured projections is shown in Table Our efficient 
approach based on circulant networks obtains test error of 
16.71% on this dataset, compared to 15.60% obtained by 
the conventional model. At the same time, the circulant net¬ 
work is 4x more space efficient and 1.2x more time efficient 
than the conventional CNN. 

5.3. Experiments on ImageNet (ILSVRC-2010) 

ImageNet is a dataset containing over 15 million labeled 
high-resolution images belonging to roughly 22,000 cate¬ 
gories. Starting in 2010, as part of the Pascal Visual Ob¬ 
ject Challenge, an annual competition called the ImageNet 
Large-Scale Visual Recognition Challenge (ILSVRC) has 
been held. A subset of ImageNet with roughly 1000 images 
in each of 1000 considered categories is used in this chal¬ 
lenge. Our experiments were performed on ILSVRC-2010. 

We use a standard CNN network - “AlexNet” ll23l as 
the building block. The AlexNet consists of 5 convolu¬ 
tional layers, 2 fully-connected layers and 1 final softmax 
layer. Rectified linear units (ReLU) are used as the activa¬ 
tion units. Pooling layers and response normalization layers 
are also used between convolutional layers. Our circulant 
network version involves three components: 1) a feature 
extractor, 2) fully circulant layers, and 3) a softmax clas¬ 
sification layer. For 1 and 3 we utilize the Caffe package 
Eli . For 2, we implement it with Cuda FFT 

All models are trained using mini-batch stochastic gra¬ 
dient descent (SGD) with momentum on batches of 128 


images with the momentum parameter fixed at 0.9. We 
set the initial learning rate to 0.01, and manually decrease 
the learning rate if the network stops improving as in f23\ 
according to a schedule determined on a validation set. 
Dataset augmentation is also exploited. 

Table H shows the error rate of various models. We 
have used two types of structures for the proposed method. 
Circulant CNN 1 replaces the fully connected layers of 
AlexNet with circulant layers. Circulant CNN 2 uses “fat¬ 
ter” circulant layers compared to Circulant CNN 1: d of 
Circulant CNN 2 is set to be 2^^^. In “Reduced AlexNet”, 
we reduce the parameter size on the fully-connected layer 
of the original AlexNet to a size similar to our Circulant 
CNN by cutting d. We have the following observations. 

• The performance of Randomized Circulant CNN l[^ is 
very competitive with Randomized AlexNet. This is ex¬ 
pected as the circulant projection closely simulates a fully 
randomized projection (Section [TT] ). 

• Optimization significantly improves the performance for 
both unstructured projections and circulant projections. 
The performance of Circulant CNN 1 is very competitive 
with AlexNet, yet with fraction of the space cost. 

• By tweaking the structure to include more parameters, 
Circulant CNN 2 further drops the error rate to 17.8%, 
yet it takes only a marginally larger amount of space com¬ 
pared to Circulant CNN 1, an 18x space saving compared 
to AlexNet. 

• With the same memory cost, the Reduced AlexNet per¬ 
forms much worse than Circulant CNN 1. 

In addition, one interesting finding is that “dropout”, 
which is widely used in training CNNs, does not improve 
the performance of circulant neural networks. In fact it 
increases the error rate from 19.4% (without dropout) to 
20.3% (not shown in the figure). This indicates that the pro¬ 
posed method is more immune to over-fitting. We also show 
the training time (per image) on the standard and circulant 

^This is Circulant CNN 1 with randomized circulant projections. In 
other words, only the convolutional layer is optimized. 



















Method 

Top-5 Error 

Top-1 Error 

Memory (MB) 

Randomized AlexNet 

33.5% 

61.7% 

233.2 

Randomized Circulant CNN 1 

35.2% 

62.8% 

20.5 

AlexNet 

17.1 % 

42.8% 

233.2 

Circulant CNN 1 

19.4 % 

44.1% 

20.5 

Circulant CNN 2 

17.8 % 

43.2% 

20.7 

Reduced-AlexNet 

37.2 % 

65.3% 

20.7 


Table 4. Classification error rate and memory cost on ILSVRC-2010. 


d 

Eull projection 

Circulant projection 

Speedup 

Space Saving (in a fully connected layer) 


2.97 

2.52 

L18x 

LOOOx 

~^rr- 

3.84 

2.79 

L38x 

4,000x 

2^^ 

19.5 

5.43 

3.60x 

30,000x 


Table 5. Comparison of training time (ms per image) and space of full projection and circulant projection. Speedup is defined as the time 
of circulant projection divided by the time of unstructured projection. Space saving is defined as the space of storing the circulant model 
by the space of storing the unstructured matrix. The unstructured projection matrix in conventional neural networks takes more than 90% 
of the space cost. In AlexNet, d is 2^^. 


versions of AlexNet. We vary the number of hidden nodes d 
in the fully connected layers and compare the training time 
until the model converges (ms/per image). Table shows 
the result. Our method provides dramatic space savings and 
significant speedup compared to the conventional approach. 


5.4. Reduced Training Set Size 

Compared to the neural network model with unstruc¬ 
tured projections, the circulant neural network has fewer 
parameters. Intuitively, this may bring the benefit of bet¬ 
ter model generalization. In other words, the circulant neu¬ 
ral network might be less data hungry compared to conven¬ 
tional neural networks. To verify our assumption, we report 
the performance of each model when trained with different 
training set sizes on the MINST, CIFAR-10, and ILSVRC- 
2010 datasets. Figure [T] shows test error rate when training 
on a random subset of the training data. On MNIST and 
ILSVRC-2010, to achieve a fixed error rate, the circulant 
models need less data. On CIFAR-10, this improvement is 
limited as the circulant layer only occupies a small part of 
the model. 


5.5. Results Without D 


As noted in Section 3.3 the sign flipping matrix D of 0 
is important in our formulation. We provide some empirical 
results in this section. On the MNIST dataset, the test error 
rate increases from 0.95% to 2.45% by dropping D. On the 
CIFAR-10 dataset, the test error rate increases from 16.71% 
to 21.33% by dropping D. 


6. Discussion 

6.1. Fully Connected Layer vs. Convolution Layer 

The goal of the method developed in this paper is to im¬ 
prove the efficiency of the fully connected layers of neu¬ 
ral networks. In convolutional architectures, the fully con¬ 
nected layers are often the bottleneck in terms of the space 
cost. For example, in “Alexnet”, the fully connected layers 
take 95% of the storage. Remarkably, the proposed method 
enables dramatic space savings in the fully connected layer 
(4000x as shown in Table [^, making it negligible in mem¬ 
ory cost compared to the convolutional layers. Our discov¬ 
ery resonates with recent work showing that the fully con¬ 
nected layers can be compressed, or even completely re¬ 
moved ESI Ell. 

In addition, the fully connected layer costs roughly 20% 
- 30% of the computation time based on our implementa¬ 
tion. The FFT-based implementation can further improve 
the time cost, though not to the same degree as the realized 
space savings, if the majority of layers are convolutional. 
Our method is complementary to the work on improving the 
time and space cost of convolutional layers nmEgiiiiiii. 

6.2. Circulant Projection vs. 2D Convolution 

One may notice that, although our approach leverages 
convolutions for speeding-up computations, it is fundamen¬ 
tally different from the convolutions performed in CNNs. 
The convolution filters in CNNs are all small 2D filters aim¬ 
ing at capturing local information of the images, whereas 
the proposed method is used to replace the fully connected 
layers, which are often “big” layers capturing global infor¬ 
mation. The operation involved is large ID convolution 
rather than small 2D convolution. The circulant projection 
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Figure 1. Test error when training with reduced dataset sizes of circulant CNN and conventional CNN. 


can be understood as “simulating” an unstructured projec¬ 
tion, with much less cost. Note that one can also apply FFT 
to compute the convolutions on the 2D convolutional layers, 
but due to the computational overhead, the speed improve¬ 
ment is generally limited on small-scale problems. In con¬ 
trast, our method can be used to dramatically speed up and 
scale the processing in fully connected layers. For instance, 
when the number of input nodes and output nodes are both 
1 million, conventional linear projection is essentially im¬ 
possible, as it requires TBs of memory. On the other hand, 
doing a convolution of two 1 million dimensional vectors 
requires only MBs of memory and tens of milliseconds. 

6.3. Towards Larger Neural Networks 

Currently, deep neural network models usually contain 
hundreds of millions of parameters. In real-world appli¬ 
cations, there exist problems which involve an increasing 
amount of data. We may need larger and deeper networks 
to learn better representations from large amounts of data. 
Compared to unstructured projections, the circulant projec¬ 
tion significantly reduces computation and storage costs. 
Therefore, with the same amount of resources, circulant 
neural networks can use deeper as well as larger fully- 
connected networks. We have conducted preliminary exper¬ 
iments showing that the circulant model can be extended at 
least lOx deeper than conventional neural networks with the 
same scale of computational resources. 

7. Conclusions 

We proposed to use circulant projections to replace 
the unstructured projections in order to optimize fully 
connected layers of neural networks. This dramatically 
improves the computational complexity from 0{(P) to 
0{d\ogd) and space complexity from 0{d^) to 0{d). An 
efficient approach was proposed for optimizing the param¬ 
eters of the circulant projections. We demonstrated empir¬ 
ically that this optimization can lead to much faster con¬ 


vergence and training time compared to conventional neu¬ 
ral networks. Our experimental analysis was carried out 
on three standard datasets, showing the effectiveness of the 
proposed approach. We also reported experiments on ran¬ 
domized circulant projections, achieving performance sim¬ 
ilar to that of unstructured randomized projections. Our on¬ 
going work is to explore different matrix structures to com¬ 
press and speed up neural networks. 
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